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1  Introduction 


Background 

One  of  the  primary  missions  of  the  United  States  Army  is  to  maintain  a  high 
state  of  readiness  so  it  can  meet  any  challenges  to  national  defense.  To  accom¬ 
plish  this  mission,  the  Army  is  constantly  training  soldiers  for  battle  on  over  12 
million  acres  of  Department  of  Defense  (DoD)  lands.  The  Army  is  also  charged 
with  the  stewardship  of  the  lands  on  which  it  conducts  that  training.  The  Army 
uses  Land  Condition  Trend  Analysis  (LCTA)  as  a  means  to  inventory  and  moni¬ 
tor  natural  resources.  LCTA  was  developed  by  the  U.S.  Army  Construction  En¬ 
gineering  Research  Laboratory  (CERL)  imder  the  sponsorship  of  the  U.S.  Army 
Engineering  and  Housing  Support  Center  (USAEHSC).  It  uses  standardized 
methods  to  collect,  analyze,  and  report  natiiral  resources  data  (Diersing,  Shaw, 
and  Tazik  1992)  as  part  of  the  Army’s  Integrated  Training  and  Management 
(ITAM)  program.  An  informal  review  of  installation  ITAM  personnel  indicated 
an  interest  in  estimating  plant  diversity  using  LCTA  data  and  modeling  changes 
in  plant  diversity  that  result  from  alternative  land  uses. 

When  using  a  data  set  like  LCTA  to  model  the  environment  and  make  manage¬ 
ment  decisions  based  on  that  modeling  effort,  it  stands  to  reason  that  good,  accu¬ 
rate  data  should  be  used.  The  assTomption  that  a  data  set  used  for  any  mathe¬ 
matical  or  computer  modeling  is  error-free  is  an  underlying  premise  of 
theoretical  modeling.  However,  assumptions  of  error-free  data  and  models  usu¬ 
ally  do  not  hold  true  in  the  real  world.  Error  is  a  natural  property  of  surveys  and 
modeling  and  as  such,  error  should  be  taken  into  accoimt  when  developing  any 
type  of  model. 


Objective 

The  objective  of  this  project  was  to  develop  and  test  an  error-budget  model  for 
the  population  dynamics  of  plant  communities  using  standard  data  from  the 
LCTA  program  at  the  WThite  Sands  Missile  Range,  New  Mexico.  Once  developed, 
this  error-budget  model  can  in  turn  be  used  for  a  nmnber  of  other  purposes  such 
as  data  correction,  model  evaluation,  quality  control,  and  management  decision¬ 
making. 
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Approach 

The  fact  that  some  degree  of  imcertainty  exists  in  the  data  used  to  make  man¬ 
agement  decisions  has  been  recognized  by  Army  installations.  Uncertainty  and 
a  wide  range  of  probable  answers  make  land  management  subject  to  broad  in¬ 
terpretation.  Natural  resource  persoimel  have  identified  the  need  for  some 
method  to  distinguish  usable  data  from  tmusable  data  in  computer  models  used 
to  help  them  in  making  management  decisions.  One  such  pre-existing  computer 
model  was  in  place  at  White  Sands  Missile  Range  (Cao  et  al.  2000).  The  authors 
looked  at  the  data  set  used  for  this  model  for  its  potential  as  a  test  case  and 
chose  to  use  the  model  as  a  test  case  for  implementation  of  a  mathematical  error- 
budget  model.  This  error-budget  model  was  developed  with  input  obtained 
through  hterature  review,  professional  discussions,  and  available  field  data. 


Scope 

The  error-budget  model  detailed  in  this  report  is  designed  to  improve  the  txse  of 
plant  population  models.  The  results  of  this  study  are  specifically  applicable 
only  to  the  White  Sands  Missile  Range  plant  population  model.  By  managing  for 
plant  communities,  DoD  has  the  opportunity  to  conserve  mtdtiple  species  simul¬ 
taneously.  Plant  communities  also  provide  a  useful  basis  on  which  to  under¬ 
stand  and  manage  the  natural  communities  that  support  military  training  and 
other  land  uses. 

Within  the  context  of  the  larger  DoD  mission,  the  use  of  an  error-budget  model 
allows  investigators  to  identify  errors  in  methodology,  sampling,  data  collection, 
and  recording,  modeling,  and  analysis.  This  process  will  allow  natural  resource 
personnel  to  make  more  informed  decisions  as  to  what  courses  of  action  are  ap¬ 
propriate  for  a  given  management  scenario.  Better  management  of  natural  re¬ 
sources  at  the  installation  level  will  lead  to  reduced  restrictions  on  the  military 
mission. 


Mode  of  Technology  Transfer 

The  information  in  this  report  will  be  provided  to  Army  personnel  responsible  for 
assisting  with  natural  resource  management  issues.  The  information  will  also 
be  provided  to  organizations  responsible  for  developing  and  refining  natural  re¬ 
source  conservation  methodologies  through  hard  copy  reports  and  through  the 
CERL  web  site  (www.cecer.army.mil). 
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The  error  budget  for  the  plant  population  model  included  in  this  report  is  part  of 
a  larger  research  effort  that  is  developing  protocols  and  tools  to  account  for  un¬ 
certainty  in  natural  resources  modeling  efforts  and  decisionmaking  processes. 
This  broader  research  effort  involves  developing  error  budgets  for  a  range  of 
natural  resources  models  as  a  way  of  evaluating  the  imcertainty  analysis  tools 
and  protocols. 
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2  Error-Budget  Modeling 


Sturveys  of  plant  populations  are  an  integral  part  of  natural  resource  manage¬ 
ment.  When  a  survey  of  a  population  is  completed,  the  results  of  the  survey  are 
used  by  decisionmakers  to  make  quantitative  statements  about  the  population 
being  studied.  This  quantitative  information  helps  managers  make  decisions  or 
perform  actions  that  will  affect  that  population.  Errors  in  these  statements  can 
lead  to  erroneous  decisions  and  actions  that  have  the  potential  to  cause  substan¬ 
tial  losses  to  the  species  the  natural  resource  managers  are  charged  with  pro¬ 
tecting.  Thus,  these  errors  should  be  carefully  studied  before  committing  time 
and  resources  to  any  management  project.  An  error-budget  model  is  used  to 
trace  the  sources  of  error  and  their  effects  on  the  quantitative  statements  that 
are  made  using  sample  data  collected  in  the  field.  The  importance  of  an  error- 
budget  model  cannot  be  overstated  when  dealing  with  natiual  resources. 

First,  the  error-budget  model  eveduates  the  quantitative  statements  made  from  a 
survey.  Given  all  the  errors,  the  error-budget  model  can  tell  if  the  statements 
made  are  valid  or  invalid.  A  statement  is  vahd  only  if  its  error  is  within  certain 
limits.  If  a  statement’s  error  does  not  fall  between  the  accepted  parameters,  that 
statement  will  usually  provide  httle  useful  information.  Second,  the  error- 
budget  model  can  guide  survey  decisions.  Using  error  sensitivity  analysis,  aU 
t3q)es  of  error  soimces  can  be  tested  to  determine  their  effects  on  the  final  state¬ 
ment.  Knowing  the  effect  each  source  of  error  has  on  a  statement,  we  can  tailor 
a  survey  effort  to  control  those  error  sources  that  contribute  the  most  to  the  final 
error.  This  is  done  based  on  the  sensitivity  of  the  error  soxirces.  In  this  way  we 
may  obtain  maximum  accuracy  with  minimum  cost.  Third,  an  error-budget 
model  provides  the  information  that  can  be  used  for  error  correction.  To  correct 
errors,  we  must  first  know  the  sources  of  the  errors.  Errors  emanating  from  dif¬ 
ferent  sources  may  require  different  procedures  to  correct  them.  Using  error  de¬ 
composition,  we  can  determine  the  major  caiises  of  the  errors.  Figure  1  shows 
the  basic  components  of  an  error-budget  model. 
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Error  Budget  Model  for  Plant  Populations 


Figure  1.  A  conceptual  model  of  an  error  budget  for  a  plant  population  model. 


Error  Sources 

Errors  are  classified  into  two  basic  categories:  system  error  and  survey  error. 
Distinctions  between  these  two  errors  are  based  on  their  sources.  System  error 
is  a  natural  character  of  the  system  being  modeled.  It  is  determined  by  the  sys¬ 
tem  itself.  There  are  two  types  of  system  errors:  demographic  noise  and  envi¬ 
ronmental  noise  (Gotelli  1998).  Demographic  noise  (or  within-individual  vari¬ 
ability)  is  the  variation  between  individuals  who  are  apparently  identical  but 
have  different  hfe  spans  and  produce  different  numbers  of  offspring.  Stochastic 
models  are  typically  used  to  investigate  the  consequences  of  demographic  noise. 
Environmental  noise  is  so  termed  because  of  the  fact  that  changes  in  the  envi¬ 
ronment  vary  impredictably  through  time.  These  changes  affect  individuals  in 
different  ways  and  at  different  times.  The  theory  of  stochastic  process  can  be 
used  to  handle  both  types  of  system  error. 

Survey  error  is  the  deviation  of  any  sxirvey  value  from  the  true  value.  Survey 
errors  are  generally  divided  into  two  types:  samphng  errors  and  nonsamphng 
errors.  Sampling  errors  inherent  in  the  survey  design  resxilt  from  the  conscious 
choice  to  study  a  subset  rather  than  the  population  as  a  whole.  Efforts  to  control 
sampling  error  are  grounded  in  a  well-developed  theory,  as  are  the  formulas  and 
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random  selection  techniques  suitable  to  a  particular  problem  that  falls  within 
the  context  of  the  theory.  Sampling  errors  are  not  the  result  of  mistakes  per  se, 
but  mistakes  in  judgment  when  designing  a  sample  may  result  in  larger  errors. 

Nonsamphng  errors  encompass  all  the  other  things  that  contribute  to  survey  er¬ 
rors.  Nonsampling  errors  are  often  thought  of  as  being  due  entirely  to  mistakes 
and  deficiencies  during  the  development  and  execution  of  the  survey  procedm-es. 
These  errors  are  said  to  arise  fi"om  wrongly  conceived  definitions,  imperfections 
in  the  tabulation  plans,  misspecification  errors,  misclassification  errors,  and  so 
on.  A  perfect  design  would  be  free  of  nonsamphng  errors.  The  following  is  a  hst 
of  some  error  sources: 

1.  System  error.  This  error  is  controlled  by  the  system  itself  The  survey  usually 
has  httle  to  do  with  it.  Choosing  appropriate  theoretical  models  is  essential  to 
the  modehng  of  the  system  errors.  System  errors  include  environmental  noise 
and  demographic  noise. 

2.  Sixrvey  error.  Survey  error  consists  of  sampling  error  and  nonsampling  error. 

a.  Sampling  error.  Survey  estimates  are  subject  to  sampling  error  because  only 
a  subset  of  the  population  is  measured.  The  cause  of  the  samphng  error  is 
due  to  the  heterogeneity  of  the  population.  This  error  is  determined  by  the 
population  distribution  and  sampling  design. 

b.  Nonsamphng  errors  include  modehng  errors,  measurement  errors,  and  other 
errors. 

i.  Modehng  errors. 

•  Simple  models.  When  simple  mathematical  models  study  a  com¬ 
plicated  population,  investigators  have  only  an  approximate  de¬ 
scription  of  the  population.  For  example,  this  type  of  error  occiu's 
when  a  hnear  model  is  used  to  approximate  a  nonlinear  popvda- 
tion. 

•  Parameterization  error.  Parameters  in  the  models  are  iisually 
created  by  estimation.  When  estimated  parameters  are  used,  the 
results  from  the  model  may  be  quite  different  fi'om  those  that  re¬ 
sult  fi*om  theoretical  parameters. 

•  Projection  errors.  These  errors  include  prediction  error  and  recur¬ 
sion  error.  Prediction  errors  are  those  errors  that  occur  when  we 
use  current  model-based  information  to  make  a  prediction  about 
an  imknown  future.  Recursion  error  is  the  error  that  is  com¬ 
pounded  by  recursive  use  of  models  with  error. 

•  Misspecification  errors.  These  errors  occur  when  the  model  or 
model  parameters  are  misspecified. 

h.  Measurement  errors.  As  the  name  imphes,  measurement  errors  refer  to 

the  error  incurred  when  the  recorded  value  measured  on  a  study  variable 
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differs  from  the  true  value.  This  error  occurs  during  the  data  collection 

stage. 

•  Instrument  error.  Tools  for  recording  the  values  of  study  variables 
usually  have  hmited  precision.  These  tools  may-give  inaccmrate 
readings. 

•  Observer’s  error.  Observers  with  different  backgroiinds  and 
training  levels  will  report  data  with  differing  levels  of  accuracy. 

•  Temporal  and  spatial  errors.  These  errors  occur  when  the  study 
variable  changes  with  time  and  place.  For  example,  vegetation 
has  seasonal  changes. 

•  Mistakes  or  recording  errors.  Errors  of  this  type  occur  when  re¬ 
searchers  make  mistakes  reading  instruments  or  recording  the 
data.  Misclassification  of  data  is  also  considered  a  recording  error. 

iii.  Other  errors. 

•  Computation  errors.  These  errors  can  be  avoided  with  the  accu¬ 
racy  of  computations  made  by  modem  computers. 

•  Errors  due  to  catastrophe.  These  errors  include  lost  or  destroyed 
data  sampling  xmits. 

•  Human  errors.  These  errors  include  typing  and  editing  errors, 
gaps  in  knowledge,  subjective  errors,  and  so  on. 


Error  Propagation  Method 

The  following  section  (pp  11  through  20)  is  reprinted  from  Forest  Ecology  and 
Management,  Vol  71;  George  Gertner,  Xiangchi  Cao,  and  Huirong  Zhu;  “A  qual¬ 
ity  assessment  of  a  Weibull  based  growth  projection  system;”  pp  235-250;  1995, 
with  permission  from  Elsevier  Science. 

In  developing  an  error  budget  for  an  inventory/smwey  system,  the  first  step  was 
to  select  an  appropriate  method  for  determining  the  effects  of  errors  in  the 
model.  The  method  used  was  the  error  propagation  method.  Error  propagation 
has  been  used  to  estimate  prediction  variances  in  several  models.  Gertner  (1987, 
1988)  used  this  method  to  determine  the  prediction  variances  of  STEMS 
(Belcher,  Holdaway,  and  Brand  1982),  a  distance-independent  growth  projection 
model  for  the  north-central  region  of  the  United  States.  An  error  propagation 
method  was  also  used  to  develop  some  very  simple  error  budgets  for  STEMS 
(Gertner  1990a).  Mowrer  and  Frayer  (1986)  and  Mowrer  (1988)  used  error 
propagation  to  estimate  prediction  variances  of  several  stand-level  growth  mod¬ 
els.  In  addition,  Gertner  (1990b)  and  Gertner  and  Kohl  (1992)  used  error  propa¬ 
gation  techniques  to  assess  different  inventory  systems. 
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There  were  a  number  of  reasons  why  the  error  propagation  method  was  used  for 
developing  the  error  budget: 

•  An  error  budget  developed  from  error  propagation  is  computationally  effi¬ 
cient.  Using  a  crude  error  propagation  method  to  estimate  final  predic¬ 
tion  variance,  Gertner  (1987, 1988)  has  shown  that  results  comparable  to 
those  of  the  crude  Monte  Carlo  method  can  be  obtained  at  only  a  fraction 
of  the  computational  cost.  Error  budgets  based  on  error  propagation  will 
have  similar  computational  efficiency. 

•  Except  for  testing  purposes,  high-quality  independent  data  are  not  neces¬ 
sary  for  the  construction  of  an  error  budget  based  on  error  propagation. 
This  is  true  because  error  propagation  methods  determine  the  effect  of  er¬ 
rors  on  a  model  based  on  the  initial  properties  of  that  model.  Therefore, 
there  is  no  need  to  use  additional  independent  data. 

•  Once  appropriate  error  propagation  procedures  are  incorporated  into  a 
multi-component  model,  error  budgets  can  be  generated  on-line  and  the 
prediction  quahty  can  be  assessed  routinely. 

•  During  a  simulation  run,  the  bias  and  variance  of  each  ftmction  in  a 
model  can  be  output  regularly.  This  practice  allows  for  constant  moni¬ 
toring  of  the  accumulation  of  biases  and  variances. 

To  develop  error  budgets,  the  error  propagation  equations  for  accounting  bias, 
variance,  and  covariance  approximations  were  extended  from  those  used  by 
Gertner  and  Mowrer.  This  was  necessary  due  to  the  complexities  of  the  com¬ 
bined  monitoring-projection  system.  Extension  was  also  necessitated  by  the 
need  for  very  detailed  error  budgets  to  conduct  the  general  quality  assessments. 
Since  there  is  concern  with  the  potential  problems  of  model  curvature  and  the 
resulting  biases  due  to  said  curvature  (Gertner  1991),  the  assessment  was  con¬ 
ducted  using  a  second-order  Taylor  series.  The  propagation  equations  were  de¬ 
veloped  to  give  rise  to  the  biases,  variances,  and  covariances  of  each  component 
of  the  model’s  parameters  and  predictions  through  the  system.  Below  is  the 
theoretical  development  of  the  error  propagation  equations  used  in  developing 
the  error  budgets. 

Assuming  an  exact  function  f  is  used  to  make  predictions: 


Y  =  f(B,X) 


Where  Y is  a  prediction  made  with  the  function,  X  =  (Xj,  X^,  ...,  X_^)  is  a  vector  of 
input  variables,  and  B  =  (bj,  bj,  ...,  b„)  is  a  vector  of  known  parameters.  X  is 
usually  assumed  to  be  error-free.  Now  suppose  instead  of  being  error-free,  the 
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j-th  component  of  X,  3^,  has  random  error  e.  (i.e.,  X.  =  +  ep  where  e^s  are  inde¬ 

pendently  distributed  with  mean  0  and  variance  V(ep.  Then,  the  predicted  Y 
also  has  error  due  to  the  errors  of  X.  This  error  can  be  estimated  by  Taylor  se¬ 
ries  expansion. 

Taylor  series  expansion  method 


Assume  any  vector  function  involved  has  the  Taylor  series  expansion  representa¬ 
tion  up  to  the  second  order: 

(1)  u  =  r(t)=F(tJ 

at 

re 

,  t=  : 


where  u  =  : 

,Us 


aF(tj 


dtdf 


svxto) 

d^fXto) 

at, at,  ’ 

S2fM 

atnat,  ’ 

(t-tj" 


(t-to) 


(t-tj" 

(t-tj" 


atat^ 

aeat^ 


(t-tj 

(t-tj 


If  T  is  a  random  vector  and  c  is  a  random  error  vector  with  E[8]  =  0,  then 
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(2)  U-«  =  F(T)  =  r(T.)+^^(T-T.) 


+i(T-Tj'^5^(T-TJ  +  0,(|T-T.  I’). 


at’ai 


If  E[T]  =  T  is  true,  then  from  Equation  (2)  the  following  can  be  obtained: 


(3)  E[u]=E[F(T)]»F(T„)+iE 


CT-T  \T  ^  ) 

^  dfdt^  ^ 


=  F(TJ  + 


Trace[^^^i:^\ 

di^dt^ 


Assuming  8  is  independent  of  T,  then 


(4) 


E„=Z,+Cov[F(T)] 
aF(T„) 


at' 


=  2, 


aF(Tj  ,aF(Tj,, 

at"  at"  ^ 


Denote  =  E[( A  -  E[ A])  (B  -  E[B])"  and  =  E[( A  -  E[ A])  (A  -  E[ A])"  ]  = 

Cov[A]  for  any  random  vectors  A  and  B.  If  there  are  errors  in  the  variables,  for 
example,  if  the  input  vector,  instead  of  T,  is  actually  measured  as  t,  that  pos¬ 
sesses  a  bias  in  T:  Biaslr]  =  E[t-T]  =  E[t]  -  T^,  then  the  actual  explanatory  vec¬ 
tor  of  variables, 

(5)  v  =  F(r)+e  =  e+F(TJ+^^(r-T,) 

+|(r-T<,f  ^^(j-T<.)  +  0,(|r-T.  D’ 


has  a  bias  in  U. 
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(6) 


Bias[v]  =  E[v-U]  = 


3F(TJ 


Bias[r] 


Trace[ 


3^F,(T„) 

3t3C 


\ 

+Bias[r]Bias[r]^  -^t)] 


Trace[ 


a^F,(T,) 


(Z^  +  Bias[r]Bias[r]^  -^t)] 

J 


Since 

(7) 


E[v]  =  E[F(r))  =  F(T„)+^^E[r-TJ 


1 

+  — 
2 


Trace[ 

Trace[ 


V 


3-f,(tj 

dtdf 

a^F,(Tj 

atat^ 


E[(r-TJ(r-Tj"]] 


the  covariance  matrix  becomes 


(8) 


Ev  =  E, +Cov[F(r)] 

-  X,  +  E[^^(r  -  T„  -  Bias[r])  (r  -  T„  -  Bias[rl)"(^^)"] 


=  z. 


aF(TJ^  ,d¥(TJ^j 

ar  ar  ^  • 


Error  evaluation  for  nonlinear  regression  system 


Asstiming  the  input-output  system  is: 
(9)  Y  =  f(B,X)  +  £, 


rxn 

where  X  = 

and  Y  = 

,Xn, 

,Ys; 

B  = 


ren 


are  respectively  the  vectors  of  input  and  output  variables, 


is  the  parameter  vector,  e  is  the  random  error  vector  which  is  independent  of 


(B,  X)  and  satisfies  E[s]  =  0. 
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Since  the  vector  function  f(B^  is  non-linear,  it  is  often  difficult  to  calculate  the 
mean  and  covariance  matrix  for  the  output  Y,  even  if  it  is  known  that  E[X]  = 
E[B]=  p. 


Cov[X,  X]  =  E[(X  -  ^)  (X  -  ]  =  Zx  = 


V*^XnXi 


^XiXn 

^2 

^Xn  y 


Cov[B3]=2^b»  Cov[e,8]  =  Ze,  and  CovlB,X]  =  However,  a  second-order  Taylor 
series  expansion  can  be  used  to  approximate  the  covariance  matrix  for  the  out¬ 
put  Y.  Define,  for  i  =  1, ,  s 


(MMl 


“  ax  I  dup4) 


.T  dtXP,^)  _(dup,0 

dB^  [  dB, 


afKM)' 
a5„  / 


‘ixx 


T 


‘iXB 


T 


f 


iBB 


T 


sxdx’’ 


3X38^ 


3838^ 


dX,dX, 

^  3X.3X, 

''3^fi(yg,^) 

3X,dB,  •■■■■■ 

3’fi(>.^) 

V  SX.dB, 

38.38,  ’■■■ 

3=f,(>,|) 

38.38, 


dX,3X, 

S^.SX.  y 

3X,3B, 

3"fi(>,^) 

y 

38,38. 

3^r,(]B,4) 
SB.dB,  , 
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and  denote: 

z\(B,X)  =  Max,.  ^  [|  1,1  Bj  -  Pj  1,1  |]. 

The  second  order  Taylor  series  expansion  of  f  is: 

(10)  f,  (B,X)  =  f  •  +  (X  -  f )  Vk  +  /»'  (B  -  A) + |[(X  -  f (X  -  ^ ) 
+  2(X-#)^fiJ(B-«+(B-^)'f;/(B-«]+0,(A’(B,X)). 


From  Equation  (10)  the  following  approximation  can  be  obtained: 

(11)  ;7,=F[YJ  =  F[fi(B,X)]  =  £[^,]«f‘+F[X-^rf;  +/i;"F[B-y0] 

+i£[(X  -  ^ f;/(X  -  ^)  +  2(X  -  ^ 4/ (B  -  «  +  (B  -  «‘^f;,"(B  -  ;S)] 

=  fj  +  —  (Trace[f,jQ^  ]  +  2  Trace[fjj(3  Sjjg]  +  Trace  [fjgg  ^bD- 


Similarly, 


(12)  =Cov[Y,,Y^]  =  E[Yi,YJ-;7,7, 

^jX  ^BX  ^jX  ^jB  ^BX  ^iX  ^iB  ^B  ^jB. 


3 

In  the  calculations,  all  terms  of  the  order  E[A  (BX)]  or  higher  were  omitted  from 

the  assessment.  Also,  since  E[A  (B,X)]  -  E[A  (B,X)]  >  0,  the  term  involving 
products  of  traces  of  the  two  covariance  matrices,  which  has  the  order  of 
E[A^(B,X)]^  <  E[A^(B,X)],  were  also  omitted. 


Because  there  can  be  errors  in  the  actual  input  x  and  estimated  peu’ameter  b, 
with  E[x]  =  4  +  Bias[x],  E[b]  =p  +  Bias[b],  Cov[x,x]  =  Cov[b,b]  =  E,,,  and 
Cov[b,x]  =  E|^,  the  actual  output  should  be: 

(13)  y=f(b,x)  +  e 


Substituting  (B^^)  with  (b,x)  into  Equation  (2)  and  taking  the  expectation,  the 

3 

following  is  obtained  (as  above,  omit  all  terms  of  the  order  E[A  ]  where  A  = 
Max[A(B,X),  A(b,x)]): 
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(14)  E[yJ  =  E[fi(b,x)]  +  E[fJ«f-+5m5[xff;  +f;"5/a5[b] 

+i(Trace[f^"(Z,  +Bias[x]Bias[xr)l 

+  2  Trace[fi^B^  +  Bias[b]  Bias[x]^ )] 

+Trace[fi;B^(Zb  +Bias[b]  Bias[b]^)], 

(15)  Bias[yJ  =  E[yJ-;7i  «  Bias[xf  4 +f*3"  Bias[b] 

+i(Trace[f^^(S,  +Bias[x]Bias[x)"  -Z.)] 

+2Trace[4B^(Zi„  +  Bias[b]  Bias[xf  -Zbx)] 
+Trace[f;B"(2:b  +Bias[b]  Bias"  -Z3)]). 


2  2  2  2 

The  covariance  between  yj  and  yj  is  (note  that  E[A]  <  E[A  ]  and  E[A  ]  < 
E[A]E[A^]  so  E[A]  E[A^]  <  E[A% 


(1 6)  cTyiyj  =  E[yi ,  yj  ]  -  E[yi  ]  Efy^  ] 

= f;; + 4 + f;"z,  f; 


In  an  iterating  system  like  the  inventory  system,  the  output  serves  as  the  input 
for  the  next  year.  To  evaluate  the  transition  error,  the  following  covariance  be¬ 
tween  b  and  y  is  needed,  =  Cov[bjy],  which  is  calculated  from: 


(17)  Cov[b,yj]  =  E[by,]  -  E[b]  E[y,] »  Z„,  r;  +  Z^f^; . 
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Error  transition  in  iterative  system 

In  the  development  of  the  iterative  system,  it  was  assumed  that  the  models  were 
properly  specified  and  calibrated,  such  that  the  Bias  [b]  =  0  and  E  [b]  =  p.  At  the 
beginning,  it  was  assumed  that  there  were  only  random  measurement  errors  in 
the  initial  input  =  X“’*,  i.e.,  Bias[x"”]=0  and  E[x“*]  =  Then  for  the  initial 
input  x"”  and  theoretical  parameter  B  =  p  (non-random  constant),  the  output  is: 

(18)  X*”  =  f(p,x'")  +  s'” 


and  for  the  same  initial  input  but  estimated  parameter  b,  the  output  becomes: 


(19)  x'”  =  f(b,x"”)  +  s'” 


Assuming  the  data  set  used  to  estimate  the  parameter  is  independent  of  x'®*  and 
Cov|b,  =  0,  the  bias  is  as  follows: 

(20)  Bias[x,<'>]  =  E[x/'>  -X/'>]==|Trace[f4°>X]. 


It  can  be  seen  that  the  bias  in  the  output  is  created  in  a  one-step  transition.  For 


a  well  fitted  model,  it  can  be  expected  that  fj^g  = 


„,T 


SB3B’ 


or  is 


sufficiently  small,  so  the  bias  can  be  negligible.  But  for  the  model  with  large 
it  is  necessary  to  consider  the  effects  of  the  bias  on  the  future  projections  since  it 
will  use  the  current  output  as  the  next  input.  The  k-th  actual  output  is: 


(2 1)  =  f  (b,  f 


The  theoretical  output  is: 


(22)  =  + 
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The  bias  is  approximately  equal  to: 

(23)  Bias[x/'‘"’^]  =  E[Xi<'"'^]-E[X/’'^'^]«Bias[x®rf®  + 

-Trace[f.*  T  (2x^  +  Bias[x^'‘^]  Bias[x^'‘^]’^  - 
2 

+  Trace[f«.E<^>f‘^J.EJ 


where  Zx\  Z®  are  calculated  through 

(24)  =  Cov[x«  x«]  = 

+ 4';’’  St"  4-" + 4;"  I.V  4'" + 4;"  (It", 

(25)  <7«  =  Cov[X®X®] 

.<T« +4;"zr'4-'>+4;'>z,f;r>. 

and 


(26)  Cov[b,xf]»Z<‘-'>4-'>  +  Z,f,» 


:-l) 


At  the  first  step,  since  =  0,  these  terms  are  simphfied  as 
(27) 


y 


SiSj 


(1) 
'Oij  ? 


(28)  Cov[b,x5'>]=Zk4>. 


Further  steps  are  deduced  by  the  iterate  algorithms.  In  this  way  the  error  in¬ 
crease  can  be  approximated  in  each  step. 


Misclassification 

Situations  in  which  discrete  variables  are  measured  with  error  are  called  mis- 
classifications.  Classification  is  the  process  of  dividing  objects  or  items  into  mu¬ 
tually  exclusive  groups,  such  that  the  members  of  each  group  are  as  “close”  as 
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possible  to  one  another,  and  different  groups  are  as  “far”  as  possible  from  one 
another.  The  distance  is  measured  with  respect  to  specific  variable(s)  or  proper¬ 
ties  you  are  trying  to  predict.  In  the  visible  world,  objects  are  characterized  by 
their  properties.  These  properties  are  usually  measured  by  discrete  variables  or 
categorized  continuous  variables.  Based  on  their  measures,  objects  are  classified 
into  classes.  In  an  error-free  world,  every  object  belongs  to  its  right  class.  In  the 
real  world,  however,  objects  may  be  measured  with  error  and  placed  into  wrong 
classes.  These  objects  are  considered  to  be  misclassified.  Misclassification  does 
not  change  the  total  number  of  objects  or  items,  but  it  changes  the  distribution  of 
objects  among  the  classes.  For  example,  let  O  =  (Oj,  ...,0^^ )  be  the  set  of  10  ob¬ 
jects,  P  =  ( p^,...,Py.)  be  the  k  properties  of  each  object  in  O,  and  C  =  (  Cj,  c^,  Cg )  be 
the  set  of  3  classes.  Without  error,  the  class  set  is  C  =  {  {  Oj,  o^},  {  Oj,  O3,  Oj,  OjJ, 
{  o^,  Og,  O7,  Og} }.  When  the  objects  are  measured  with  error,  we  may  have  the  class 

set  C  ™  {  {  Og,  O^,  Ojg},  {  Oj,  Og,  Og  },  {  O^,  Og,  Og,  Og}  }. 

There  are  many  causes  of  misclassification,  such  as  inaccurate  measurements  of 
objects,  incomplete  information  about  the  objects,  or  human  mistakes.  The  re¬ 
sults  of  a  classification  are  used  to  make  statements  about  the  objects  being 
studied.  They  can  also  provide  information  for  decisionmaking.  Misclassifica¬ 
tion  errors  can  lead  to  an  incorrect  decision,  thereby  causing  substantial  losses. 
Thus,  classification  errors  should  be  carefully  studied. 

Errors  in  measurement  not  only  cause  larger  variance,  they  also  may  produce 
bias  in  the  results  (Gertner,  Cao,  and  Zhu  1995).  Methods  of  deahng  with  meas¬ 
urement  errors  have  been  proposed  and  successfully  used  in  applications  hke 
error  propagation  (Gelb  et  al.  1974;  Gertner,  Cao,  and  Zhu  1995)  and  error  ap¬ 
proximation  (Gertner  1987).  These  methods  however,  cannot  be  applied  in  in¬ 
stances  of  misclassification  because  of  its  property  as  a  closed  system.  The  prob¬ 
lem  of  misclassification  has  been  considered  from  the  different  viewpoint  by 
many  investigators.  To  adjust  for  misclassification,  Tenebein  (1979)  proposed  a 
double  sampling  scheme  for  binomial  data.  Chen  (1979,  1989)  gave  a  review  of 
methods  for  misclassified  categorical  data  and  the  maximum  hkelihood  estima¬ 
tion  for  loglinear  models.  Geng  (1989),  York  (1992),  and  Viana  (1994)  applied 
Bayesian  estimation  methods  to  the  problem  of  misclassification  and  incomplete 
data.  In  this  study,  we  will  discuss  two  approaches  to  modehng  misclassifica¬ 
tion:  likelihood  function  methods  and  Bayesian  estimation  methods.  We  will 
apply  these  methods  to  the  estimation  of  biodiversity  with  misclassifications. 

Likelihood  function  method 

In  some  systems,  objects  are  distributed  in  theoretic  patterns.  The  distribution 
of  objects  in  these  systems  can  be  described  precisely  in  a  mathematical  fashion. 


22 


ERDC/CERL  TR-00-12 


Also,  when  data  or  information  are  not  sufficient  to  make  statistical  conclusions 
about  the  distribution  of  the  objects  studied,  assvunptions  of  theoretic  distribu¬ 
tions  must  be  made.  Among  all  the  theoretic  distribution  functions,  we  find  that 
the  beta  fimction  has  the  highest  flexibility  to  model  a  wide  variety  of  distribu¬ 
tion  types. 

Magnussen  and  Boyle  (1995)  use  a  beta  function  as  an  a-priori  likelihood  fimc¬ 
tion  to  represent  the  most  probable  species  abundance  distributions  (MOPSAD). 
Shannon  and  Simpson  indices  are  calculated  by  using  MOPSAD.  The  method  of 
using  MOPSAD  considers  the  variations  due  to  sampling.  Based  on  this  model, 
we  propose  a  similar  approach  for  estimating  diversity  indices  with  misclassifca- 
tion. 

Suppose  we  have  the  following  beta  function,  (also  called  a  likelihood  function) 
as  a  species  abundance  distribution. 

(29)  L{p,\a,P)=^pr'^{\-p,f-'IB{a,p) 


L(p^  \  a,  fi)  is  the  likelihood  of  species  s  having  a  relative  p^ .  a  and  P  are  two  pa¬ 
rameters  of  the  beta  function.  B{a,  0)  is  the  complete  beta  function  of  r(a) 
T{p)fr{a+p).  With  different  combinations  of  the  a  and  values,  the  likelihood 
function  Lip)  gives  different  types  of  cm*ves.  Figure  2  displays  foxm  typical  types 
of  MOPSAD  priori  for  plant  communities,  which  are  called  inverse  J-shaped,  J- 
shaped,  bell  shaped,  and  flat  priori  by  Magnussen  and  Boyle  (1995).  The  inverse 
J-shaped  distribution  usually  represents  commimities  in  which  there  are  a  lot  of 
rare  species  and  very  few  dominating  species.  In  contrast,  the  J-shaped  curves 
illustrate  a  situation  in  which  the  commimities  are  dominated  by  a  few  principal 
species.  It  is  unlikely  one  would  find  many  rare  species  in  these  instances.  The 
bell-shaped  species  distribution  appears  in  many  Montane  temperate  forests.  In 
these  forests  the  two  extremes  (dominated  entirely  by  rare  species  and  domi¬ 
nated  by  only  a  few  common  species)  are  rare.  The  flat  priori  distribution  repre¬ 
sents  an  even  distribution  of  species. 
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0  1 
RELATIVE  ABUNDANCE  (P*) 


0  1 

RELATIVE  ABUNDANCE  (Ps) 


RELATIVE  ABUNDANCE  (P*)  RELATIVE  ABUNDANCE  (ps) 

(a)  Inverse  J-shaped  /.(pJO.5, 3.0),  (b)  J-shaped  L{p.|3.0, 0.5),  (c)  bell-shaped  /.{p.|2.0, 2.0),  and  (d)  a  flat  priori 
/.(PJI.0, 1.0), 

Figure  2.  Four  typical  types  of  species  abundance  distributions. 


In  this  study,  we  use  a  Shannon  index  as  an  example  of  the  error  analysis  of 
plant  diversity.  A  Shannon  index  crystallizes  both  species  richness  and  species 
evenness  into  a  single  number  (Shannon  and  Weaver  1949).  The  value  of  the 
Shannon  index  is  determined  by  both  the  numbers  of  species  and  species  distri¬ 
butions.  The  following  is  the  formula  for  the  Shannon  index. 


(30)  //  =  -Xp,xlog(/>,), 

1=1 


H  is  the  Shannon  index,  p.  is  the  relative  abundance  of  species  i,  and  s  is  the  to¬ 
tal  number  of  species. 

The  expected  Shannon  index  for  a  plant  conununity  with  a  beta  distribution  rep¬ 
resenting  MOPSAD  is  foimd  by  summing  all  possible  relative  species  abvm- 
dance’s  (0  <  <  1)  with  each  summoned  given  a  weight  equal  to  its  likelihood 

L(p).  Magnussen  and  Boyle  (1995)  give  the  conditional  expectation  of  the  Shan¬ 
non  index. 
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(31) 


lo  ^L{p,)dp^ 

Jo  -p“  ^  yPs 

B{a^P)y.al{a  +  P) 


Jo  ps  X  L{ps)dps  =  a  I {a  +  fi)  is  the  mean  species  abundance  for  the  chosen 
MOPSAD  and  log  is  the  natimal  log  function. 


Based  on  Equation  (31),  we  give  the  following  the  conditional  variance  of  E(.H  I  a, 

P): 

(32)  Var{H  \a,p)  =  Ji[(-/J,  x log(;?,)  x^-E(H  | a, p)f  x L{p,)dp, 


Without  misclassification  error,  Equations  (31)  and  (32)  give  the  expectation  and 
variance  of  Shannon  index  (Equation  30).  Figure  3  shows  the  expected  Shannon 
index  with  a  and  P  ranged  from  0.5  to  10. 

Misclassification  can  occur  for  many  reasons  in  a  plant  species  survey.  The  ma¬ 
jor  somces  of  species  misclassification  are  incorrectly  identifying  species,  re¬ 
cording  in  the  wrong  catalogue,  miscoding  species,  or  using  poor  quality  speci¬ 
men.  Incorrectly  identifying  species  is  related  to  weather,  season,  and  human 
background  on  plant  study.  With  the  guidance  of  experts,  this  error  can  be  very 
small.  Recording  errors  occtir  very  often  among  species  with  similar  codes  such 


Figure  3.  The  expected  Shannon  index  with  alpha  and  beta  ranged  from  0.5  to  10. 
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as  ABCDl  and  ABCD2.  When  there  are  many  species  with  similar  species  codes 
in  one  plot,  the  recording  errors  can  be  very  large,  contributing  the  most  to  the 
total  misclassification.  Miscoding  species  happens  usually  at  data  entry.  Incor¬ 
rectly  typing  a  code  results  in  a  “new  species.”  This  error  can  alter  the  Shannon 
index  significantly.  Specimens  of  unknown  species  are  often  collected  from  the 
field  for  further  identification.  If  the  specimens  are  not  handled  properly,  or  are 
stored  for  too  long,  accurate  identification  becomes  difficult. 

If  only  misclassification  is  considered,  the  error  in  absolute  species  abundance  is 
linear  in  relation  to  the  error  in  relative  species  abimdance.  This  is  because  the 
population  is  a  closed  system  when  only  misclassification  error  is  considered. 
That  is  to  say,  misclassification  does  not  change  the  total  population;  it  only 
changes  the  species  distributions.  The  error  for  the  total  population  due  to  mis¬ 
classification  is  zeros.  Therefore,  error  in  relative  species  abundance  is  hnear  to 
the  error  in  absolute  species  abundance. 

Let  p,  be  the  relative  abimdance  of  species  s  without  misclassification,  and  be 
the  relative  abundance  of  species  s  with  misclassification  and  be  the  misclassi¬ 
fication  error  in  relative  species  abundance.  We  have 

(33)  p;=p.  +  e. 


where  p,  is  beta  distributed.  Because  the  population  is  a  closed  system,  p^'  must 
be  also  beta  distributed.  This  property  makes  it  difficult  to  find  a  distribution 
for  the  error  term  e,.  Misclassification  actually  just  shifts  the  beta  distribution 
from  L(p,  \  a,  0)  to  L(p^  I  d,  P'),  where  a  and  P  are  the  parameters  of  beta  function 
without  misclassification,  and  d  and  p'  are  the  parameters  of  beta  function  with 
misclassification.  Instead  of  finding  the  distribution  for  the  error  term  e^,  we 
look  for  the  relationship  between  the  error  and  distribution  shift.  For  example, 
when  we  say  20%  misclassification  error  in  species  abimdance,  we  imply  that  the 
difference  of  area  between  the  L(p,  I  a,  p)  and  L(pJ  I  d,  p')  curves  is  20%.  In  gen¬ 
eral,  we  have 

041  J!, \L(p, \a,P)-L(p:\a\n\pM  . 


where  d  is  the  misclassification  error  as  a  percentage.  Thus,  the  conditional  ex¬ 
pectation  of  Shannon  index  EiJH  I  d)  is  calculated  from  the  sample  curves  of  («',  P 
I  d).  The  bias  and  variance  due  to  misclassification  are  given  by 
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(35)  bias{H)  =  E{H\  a,  P)  -  E{H\  d) 


Var(H)  =  Var{H\a,P)  +  Var{H\d) 


where  E{H  I  a,  p)  and  EQi  I  d)  are  the  expectation  of  the  Shannon  index  without 
misclassification  and  with  d%  input  misclassification,  respectively.  Similarly  for 
variance  VariH  I  a,  P)  and  Var{H  I  d). 

A  C  program  was  written  to  model  the  errors  of  the  Shannon  index  due  to  mis¬ 
classification.  The  program  ran  on  a  LENA  SUPERCOMPUTER  of  the  National 
Center  for  Supercomputing  Applications  (NCSA)  in  Champaign,  Illinois.  In  the 
simulation,  we  took  {a,  P)  in  the  likelihood  function  as  (1.0,  1.0),  (2.0,  2.0),  (3.0, 
0.5),  and  (0.5,  3.0)  for  flat  priori,  bell-shaped,  J-shaped,  and  inverse  J-shaped 
curves  respectively.  The  results  from  the  program  are  used  to  generate  the  ta¬ 
bles  discussed  in  the  following  paragraphs. 

Tables  1  through  4  show  the  bias  and  variance  of  the  Shannon  index  for  the  four 
t3T)ical  types  of  species  abundance  distributions.  In  Table  1  the  estimation  of  the 
Shannon  index  is  calculated  based  on  Equations  (31)  and  (32),  which  is  the  case 
without  misclassification.  The  error  in  Table  1  is  mainly  due  to  the  natural 
variation  of  species  distribution,  which  can  be  used  to  determine  sampling  de¬ 
sign.  The  bias  and  error  in  Tables  2  through  4  only  count  for  the  misclassifica¬ 
tion  of  two  input  error  limits  of  10%  and  20%.  Comparing  Tables  1  and  2,  we  can 
see  that  the  random  misclassification  does  not  contribute  much  to  the  variance, 
but  it  does  produce  bias  even  if  the  input  data  is  unbiased. 

Table  2  shows  the  bias  and  variance  of  the  Shannon  index  due  to  random  mis¬ 
classification.  From  this  table  we  can  see  that  different  species  distributions 
have  different  sensitivities  to  misclassifications.  The  flat  priori  and  beU  shaped 
distributions  have  the  most  resistance  to  misclassifications.  With  10%  input 
misclassification,  the  bias  is  about  1.75%,  part  of  which  may  due  to  rounding  er¬ 
ror  in  the  computation  of  the  complete  beta  function  B  {a,  p).  With  larger  input 
error  (20%),  misclassification  causes  about  6.77%  bias  in  the  Shannon  index. 
One  of  the  reasons  for  the  lower  sensitivity  of  misclassification  in  the  flat  priori 
and  bell  shaped  distributions  is  that  the  chances  of  incorrectly  classifying  rare 
species  as  common  species  and  vice  versa  are  relatively  equal.  In  other  words, 
the  effects  of  misclassification  are  canceled  out  by  each  other.  In  the  J-shaped 
case,  the  chances  of  incorrectly  classifying  rare  species  as  common  species  are 
less  than  that  of  incorrectly  classifying  common  species  as  rare  species  because 


ERDC/CERLTR-00-12 


27 


there  are  a  few  rare  species  in  a  community  dominated  by  common  species.  Mis- 
classification  may  create  “new  species”  or  increase  the  evenness  of  species  distri¬ 
bution.  This  causes  systematic  increase  of  the  Shannon  index.  With  10%  input 
misclassification,  the  Shannon  index  increases  about  5%.  With  20%  misclassi- 
fied,  the  Shannon  index  increases  11%.  In  contrast  with  the  inverse  J-shaped 
case,  the  Shannon  index  decreases  with  misclassification.  The  Shannon  index 
decreases  4%  with  10%  input  error  and  18%  with  20%  input  error.  In  the  in¬ 
verse  J-shaped  case,  because  there  are  many  rare  species  and  very  few  domi¬ 
nating  species,  it  is  more  hkely  to  overlook  some  rare  species.  This  will  reduce 
the  number  of  species  and  causes  the  Shannon  index  to  decrease. 

Tables  3  and  4  give  the  bias  and  variance  of  the  Shannon  index  due  to  system¬ 
atic  misclassification.  As  an  example,  we  suppose  the  bias  of  input  error  is  lin¬ 
ear  to  its  Shannon  index,  E(H  I  d)  =  E(H  I  ce,  j0)  +  w  *  E(H  I  d).  As  shown  in  Tables 
3  and  4,  both  the  bias  and  variance  of  the  Shannon  index  are  larger  with  sys¬ 
tematic  misclassification.  Table  3  illustrates  that  more  common  species  are  mis- 
classified  as  rare  species  or  species  are  misidentified  as  “new  species.”  That  is, 
the  misclassification  increases  the  evenness  of  species  distribution  or  number  of 
species.  This  causes  the  Shannon  index  to  increase.  Table  4  demonstrates  that 
more  rare  species  are  misclassified  as  common  species.  This  drops  the  evenness 
of  species  distribution  and  causes  a  decrease  in  the  Shannon  index. 


Table  1.  Shannon  index  without  misclassification. 


Model 

flat  priori 

bell-shaped 

J-shaped 

Inverse  J-shaped 

mean 

0.5 

0.5833 

0.1388 

1.4132 

variance 

0.0461 

0.0257 

0.0165 

0.0723 

Error^  {%) 

42.8945 

27.4938 

92.59 

60.1770 

'  Error  (%)  =  (standard  deviation)/(true  index)*  100 


Table  2.  Bias  and  variance  of  Shannon  index  due  to  random  misclassification. 


Model 

Flat 

priori 

Bell-shaped 

J-shaped 

Inverse  J-shaped 

10% 

20% 

10% 

20% 

10% 

20% 

10% 

20% 

Mean 

0.5088 

0.5338 

^^191 

HIESSli 

0.1458 

0.1553 

1.347 

1.1529 

Variance 

0.0022 

0.0095 

0.0001 

0.0003 

0.0711 

0.0818 

Error  (%) 

9.4453 

19.5184 

7.7545 

16.6964 

6.5929 

12.4139 

18.8735 

20.2347 

Bias 

0.0088 

0.0338 

0.0052 

0.0247 

0.007 

0.0165 

-0.0662 

-0.2603 

Bias  (%) 

1.7573 

6.7682 

0.8895 

4.2423 

5.0707 

11.9189 

-4.6878 

-18.4224 
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Table  3.  Bias  and  variance  of  Shannon  index  due  to  weighted  misclassification  of  common 
species  as  rare  species. 


Model 

Flat  priori 

Bell-shaped 

J-shaped 

Inverse  J-shaped 

Input  error 

10% 

20% 

10% 

20% 

10% 

20% 

10% 

20% 

Mean 

0.5597 

0.6406 

0.6473 

0.7297 

0.1864 

1.4816 

1.3834 

Variance 

0.0027 

0.0137 

0.0025 

0.0137 

0.0004 

0.0861 

0.1197 

Error  (%) 

10.3899 

23.4221 

8.53 

20.0357 

7.2522 

15.0208 

20.7608 

24.484 

Bias 

0.0597 

0.1406 

0.064 

0.1464 

0.0216 

0.0476 

0.0684 

-0.0298 

Bias  (%) 

11.933 

28.1219 

10.9784 

25.0907 

15.5778 

34.3027 

4.8434 

-2.1069  . 

Table  4.  Bias  and  variance  of  Shannon  index  due  to  weighted  misclassification  of  rare  species 
as  common  species. 


Model 

Flat  priori 

Bell-shaped 

J-shaped 

Inverse  J-shaped 

Input  error 

10% 

20% 

10% 

20% 

10% 

20% 

10% 

20% 

Mean 

0.463 

0.4484 

0.5355 

0.5108 

0.1327 

0.1305 

1.2257 

0.9684 

Variance 

0.0018 

0.0067 

0.0017 

0.0067 

0.0002 

0.0589 

0.0577 

Error  (%) 

8.5953 

16.3955 

7.0566 

14.025 

16.9972 

Bias 

-0.037 

-0.0516 

-0.0725 

-0.0061 

-0.0083 

-0.1875 

-0.4448 

Bias  (%) 

-7.4009 

-10.3147 

-8.1906 

-12.4365 

-4.3857 

-5.9881 

-13.2659 

-31.4748 

Bayesian  estimation  method 

A  basic  assumption  of  the  method  of  likelihood  function  is  that  species  abun¬ 
dance  distribution  follows  a  beta  distribution.  The  distribution  of  real-world 
species  abundance  may  not  follow  any  theoretic  distribution.  To  adjust  for  the 
misclassification  in  such  a  case,  misclassification  probability  or  a  double  sam¬ 
pling  scheme  is  used.  The  following  methods  are  adopted  fi*om  Viana  (1994)  and 
Geng(1989). 

Let  us  consider  binomial  data  first.  The  extension  from  binomial  to  multinomial 
data  is  straightforward.  Let  x  =  (Xj,  x^)  be  the  observed  binomial  data  subject  to 
misclassification.  Let  p  =  (p/,  p/)  and  p  =  (pj,  p^)  be  the  corresponding  observed 
and  true  probability  distributions,  respectively,  and  M  be  the  2x2  matrix  of 
classification  error  probabilities,  so  that  p'  =  Af  p,  or 

'"uTPi' 

"^221p2. 


where  rriy  is  the  conditional  probabUity  of  observed  state  j  given  the  true  state  i. 
There  are  two  cases  based  on  the  knowledge  of  the  misclassification  probability 
matrix  M:  M  is  known  and  M  is  imknown.  We  will  discuss  two  methods  corre¬ 
sponding  to  these  two  cases. 
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Cases  where  M  Is  known.  When  M  is  known,  estimates  of  true  classification  prob¬ 
ability  p  are  obtained  from 

(37)  />=My 

under  the  condition  that  exists.  In  practice,  we  have  matrix  M  and  data  x  = 
(Xj,  Xj).  We  wish  to  know  the  expectation  and  variance  of  the  estimated  p  from  M 
and  X.  Viana  (1994)  gave  the  posterior  density  of  p,  given  x  and  M. 


The  posterior  density  f(p\x,M)\s  a  weighted  sum  of  beta  density  functions  given 
by: 

(38)  + 

re9!  u 


where 


i^}\ 

lF,=5(a  +  2v)n 


n'”«v 


L{p-,a)  =  pr'pT-~'  IB{a\ 


and  91  is  the  set  of  all  matrices  r  with  entries  such  that  with  u=l  or  2. 

If  the  observed  data  x,  x  =  (Xj,...,  x^)  is  a  multinomial  vector,  the  posterior  prob¬ 
ability  density  /'(;rlx,  M)  is  a  weighted  sum  of  Dirichlet  density  functions  (Viana 
1994): 

(39)  /(p|A:,M)  =  X®.-D(p;a  +  Zv). 

re9I  u 


where  (o^  and  sure  the  k-dimension  of  their  definitions  given  above. 


U 


is  the  Dirichlet  function  with  the  vector  of  parameters  (cr  + 

U 


Cases  where  M  is  unknown.  When  M  is  unknown,  a  double  sampling  scheme  is 
used  to  calculate  the  posterior  means  of  classification  probabihties.  When  double 
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sampling,  we  observe  individuals  by  using  a  cheap  but  error-prone  method.  We 
then  categorize  a  random  subsample  by  using  a  precise  but  expensive  method  to 
adjust  for  misclassification. 

Let  us  consider  a  2  x  2  table  with  two  variables.  In  the  discussion  we  will  use 
notations  given  by  Geng  (1989): 

A,  B:  error-free  variables; 

a,  b:  error-prone  variables; 

observed  frequencies  of  main  sample; 

those  of  subsample; 

cell  probabihties; 

conditional  probabihties  given  a=j  and  b=k; 

{ J :  parameters  of  Dirichlet  density  of  . 

In  these  notations, denotes  a  summation  over  the  index.  Indices  h,  i,  j,  and  k 
denote  variables  A,  B,  a,  and  b,  respectively.  Suppose  the  observations  are  mul¬ 
tinomial  data  and  prior  density  of  is  a  Dirichlet  density  with  parameters 


(40)  Di{p,y,}\{a,y,})  = 


1-r 


h,ij,k 


Tip's 


Geng  (1989)  gave  the  following  the  joint  posterior  density  of  and  {PtiijJ,  and 
the  posterior  means,  variances  and  covariance  of 


(41)  /({P++y*},  Kj,*},  I  +  ^++ jk  ^++ * 

]~J  {Phi\jk)  I  {^hijk  ^hijk  }  ) 

+ 

^hijk  ) 

P  hifk  /•  \/  \ 

++Jk  ^++Jk^ 
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(43)  rjr(p,.,) 

+ 1) 

(^hijk  +  ^hijk  +OKi,-t  +  ^hijk )  r  ' 

(a^^j,  +  n^^j,  +  \)(a^^j,+ 

^++ y/: ) 


Error  Budget  and  Sensitivity  Anaiysis 

Suppose  we  have  a  discrete  stochastic  system  without  input  control 

~  -^k^k  ^k^k 

yk  =  Q  +  ffk^k’ 

where  ei?^,w*  e/?^,v^  ei?*;  A,j,G|j,C,;,  and H^are  possible  time- 

varying,  known  matrices  of  the  appropriate  dimension,  x  and  y  are  respectively 
the  state  space  and  observation  space.  The  basic  random  variables  {Xj,Wj„  ..,v„,...} 
are  all  independent  and  Gaussian  with  Xq  ~  A(0,^  q),w^  ~  N(0,Q),Vi^  ~  N(0,  R). 

The  covariance  are  all  known.  The  available  information  at  time  k  is 
z*  =  random  variable  x^,  x^^jand  y’^  are  jointly  Gaus¬ 

sian.  Denote 


Pk\k  J  3^* )  ~  ),  and 

^k+\\ki^  k+\  |/)~^(X*.,|„ 

^k+\  ^^k+l\k^ 

yk\k-:  \y^-^},and 


yk\k~v^  yk  yk\k-\ 
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The  estimation  model  of  the  system  is 


(44)  +  L^+\{yk+\  ~ 

•^0|0  ~  -^0^0  > 

=  A  ^k\k  A.  +  G^QGl. 


*^010  •^oQ)‘^0’ 

where  4  =  Cl [Q  Zi.n._,  C[  +  H^RHl 

4  =  IoCo"[CoIoC  +  ^o^oT 


The  n-step  prediction  model  of  the  system  is 

n-\ 

(45)  ^k-^n\k~WA+i^k\k^ 

i=Q 

(46)  z,„i. = n  A.,  i«<n  + z  <5..,eGL 

/=0  /=0  /=0 

If  X*  represents  the  vector  of  species  abundance  at  time  k,  the  Shannon  index  can 
be  calculated  from  the  vector  of  species  abundance  . 

Let  us  rewrite  the  formula  of  Shannon  index. 

(47)  H  =  -f^p,x\ogip,) 

/-I 

5 

where  />,.  =  ®  munber  of  species.  Because  H  is  a  nonlinear 

7=1 

function  of  Xj^  we  can  calculate  its  error  by  the  Taylor  series  expansion  method 
as  shown  below. 

Assuming  an  exact  frmction  f  is  xised  to  make  predictions: 


(48)  Y=f(B^) 
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where  Y  is  a  prediction  made  with  the  function,  X=(X,,  X^,  X^)  is  a  vector  of 

input  variables,  and  B=(bj,  b^,  ...,  b^)  is  a  vector  of  known  parameters.  X  is  usu¬ 
ally  assumed  to  be  error-free.  Now  suppose  instead  of  being  error-free,  the  j-th 
component  of  X,  X.,  has  random  error  e.(i.e.,  x.  =  x.  +ep,  where  e^s  are  independ¬ 
ently  distributed  with  mean  0  and  variance  Vfep.  Then,  the  predicted  Y  also  has 
error  due  to  the  errors  of  X.  This  error  can  be  estimated  by  Taylor  series  expan¬ 
sion. 

Denote  f(x)=  -x  x  log  (x),  the  first  order  and  second  order  derivatives  of  f(x)  are 

(49)  ^^  =  -log(x)-l 

ax 


(50) 


d^m  1 

d^x  X 


Applying  (49)  and  (50)  in  the  error  propagation  models  (3)  and  (4),  we  have  the 
expectation  and  variance  of  the  Shannon  index 

(5 1)  E{H)  =  -X  A  X  log( A )  +  Z *  (-»og( A)  - 1)  -  ] 

/=1  /=0  ^  /=0  Pi 

(52)  r<jK//)»l;Kar(e,)*(-log(A)-l)" 

/=0 


where  p.  =  /  Z  ^^A- 

J=i 


Common  inventory  errors  in  beit  transect 

Since  1989,  the  U.S.  Army  has  delineated  permanent  core  plots  on  over  50  mili¬ 
tary  installations  and  training  areas  in  the  United  States  and  Germany.  The 
standard  size  of  an  LCTA  permanent  plot  is  100  x  6  meters  (600m^)  with  a  100-m 
line  transect  forming  the  longitudinal  axis.  The  plot  inventory  is  conducted  over 
a  2-  to  3-month  period  during  the  peak  of  the  growing  season.  The  inventory 
consists  of  four  major  elements;  land  use  assessment,  line  transect,  belt  transect. 
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and  wildlife  sampling.  The  belt  transect  is  intended  to  characterize  species  com¬ 
position,  density,  and  height  distribution  of  woody  and  succiolent  vegetation. 

The  belt  transect  extends  the  width  of  the  100-m  hne  transect.  Although  the  belt 
has  a  standard  size  of  6m,  the  width  may  be  reduced  at  the  field  crews  discretion 
for  high-density  species.  The  field  data  is  subsequently  extrapolated  to  a  stan¬ 
dard  100  X  6m*  plot  during  data  analysis.  Subject  matter  experts  have  suggested 
the  following  sources  of  inventory  errors  associated  with  LCTA  belt  transect 
methods: 

1.  Instrument  error.  The  major  instrument  error  is  fi'om  locating  the  pole  and  tape 
positions.  The  tape  may  not  go  straight  from  one  point  to  the  other  due  to  the 
dense  plants.  The  tape  may  represent  different  paths  for  different  years  of  data 
collection. 

2.  Observer’s  error.  This  error  refers  to  the  differences  in  inventory  results  made  by 
different  observers  on  the  same  plot.  One  of  the  major  differences  is  in  the  way 
each  person  counts  clumps.  Clumps  are  dense  clonal  patches  of  individual  steins. 
Some  observers  coimt  a  clump  as  a  single  plant,  while  others  may  count  each 
branch  of  a  cltunp  as  a  separate  plant. 

3.  Recording  error.  A  common  mistake  is  made  when  recording  between  species 
with  similar  codes  such  as  ABCDl  and  ABCD2.  Recording  errors  result  fi-om  dif¬ 
ferent  species  having  very  similar  codes.  Recording  errors  also  occur  because 
codes  are  often  truncated  to  help  reduce  plot  measurement  times. 

4.  Species  recognition  error.  Species  are  misidentified  due  to  poor  quality  of  the 
specimen  and  when  the  specimen  has  characteristics  that  are  similar  to  other 
species.  Species  recognition  errors  can  result  fi'om  field  crews  with  varying  levels 
of  training  and  experience. 

5.  Expansion  factor  error.  Data  fix>m  reduced  belt  transects  must  be  extrapolated  to 
a  standard  size  plot  of  100  x  Om*.  Even  for  the  same  species  in  the  same  plot,  dif¬ 
ferent  observers  may  use  different  belt  widths.  The  data  may  also  be  expanded 
by  the  wrong  factor  due  to  the  changes  of  the  expansion  factor  and  a  lost  record  of 
the  expansion  factor. 

6.  Editing  error.  Editing  error  refers  to  errors  made  in  data  entry.  A  common  mis¬ 
take  is  made  when  entering  the  wrong  species  code  into  the  database.  Entering 
the  wrong  species  means  creating  a  “new  species.” 

A  summary  of  the  subject  matter  experts  characterization  of  LCTA  belt  transect 
inventory  error  sources  is  provided  in  Table  5. 
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Table  5.  Suggested  inventory  error  limits  for  belt  transect. 


Error  sources 

Lower  bound 

Upper  bound 

Locating  poles  and 
tapes 

10 

30 

Counting  clumps 

20 

40 

Recording  error 

0 

5 

Species  recognition 

5 

10 

Temporal  shift 

0 

40 

Editing  error 

0 

30 

Expansion  factor 

0 

200 

An  example  of  error  budget  analysis 

We  chose  to  use  plant  community  type  6  from  White  Sands  Missile  Range  data 
set  (Cao  et  al.  2000)  as  an  example  of  error-budget  analysis.  Community  type  6 
covers  plots  21,  22,  64,  138,  160,  164,  and  167.  The  species  distribution  of  this 
plant  commimity  type  is  similar  to  an  inverse  J-shaped  beta  distribution  with 
alpha  0.5  and  beta  3.0.  We  ran  the  error-budget  model  with  two  types  of  error 
limits:  small  input  error  and  large  input  error.  The  results  are  shown  in  Tables 
6  and  7.  Percent  bias,  percent  error,  and  total  error  are  calculated  as: 


index  with  error  -  true  index 

Bias%  = - xl00% 

true  index 


standard  deviation 

Error  %  = - x  1 00% 

true  index 


Total  error  = 


Tables  6  and  7  show  the  error  of  the  Shannon  index  with  small  and  large  input 
errors  for  plant  community  type  6  at  White  Sands  Missile  Range.  For  the  cur¬ 
rent  estimate  of  the  Shannon  index,  the  major  error  is  from  misclassification. 
Misclassification  is  mainly  due  to  the  misidentification  of  species,  recording  spe¬ 
cies  in  the  wrong  code,  and  mistyping  species  codes  into  the  data  set.  The  major 
measurement  error  is  due  to  the  method  of  estimating  number  of  stems  in 
clumps  of  plants.  Making  more  consistent  measurements  in  clumps  and  quality 
control  of  data  entry  can  largely  reduce  these  types  of  errors.  For  the  10-year 
prediction,  the  system  error  and  modeling  error  account  for  a  larger  component 
of  the  total  error.  With  a  good  imderstanding  of  the  system  and  system  model, 
the  system  error  usually  is  small.  When  more  data  are  collected  through  contin- 
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ued  monitoring,  the  modeling  error  can  be  reduced  and  ignored  if  the  data  set  is 
large  enough. 


Table  6.  Error-budget  table  of  Shannon  Index  with  small  input  errors. 


Error  sources 

Input  Error  % 

Current  Estimate 

10-year  Prediction 

Bias  % 

Error  % 

Bias  % 

Error  % 

5 

0 

1.49 

0 

14.86 

10 

0 

2.97 

0 

2.97 

Modeling 

5 

0 

1.49 

0 

14.86 

Measurement 

10 

0 

2.97 

0 

2.97 

Misclassify 

10 

-4.69 

18.87 

-4.69 

18.87 

Total 

18.7 

-4.69 

19.45 

-4.69 

28.85 

Tabie  7.  Error-budget  tabie  of  Shannon  index  with  large  input  errors. 


Error  Sources 

Input  Error  % 

Current  Estimate 

10-year  Prediction 

Bias  % 

Error  % 

Bias  % 

Error  % 

System  error 

10 

0 

2.97 

0 

29.72 

Sampling 

50 

0 

14.86 

0 

14.86 

Modeling 

10 

0 

2.97 

0 

29.72 

Measurement 

20 

0 

5.94 

0 

5.94 

Misclassify 

20 

-18.42 

20.23 

-18.42 

20.23 

Total 

59.16 

-18.42 

26.13 

-18.42 

49.31 
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3  Conclusions 


The  benefits  of  an  error-budget  model  can  be  substantial.  First,  the  error-budget 
model  evaluates  the  statements  made  from  the  survey.  Given  all  errors,  the  er¬ 
ror-budget  model  can  determine  if  the  statements  are  valid.  A  statement  is  valid 
only  if  it  is  within  certain  error  limits.  The  statement  provides  little  useful  in¬ 
formation  if  its  errors  are  out  of  the  specified  limits.  Second,  the  error-budget 
model  guides  survey  decisions.  With  error  sensitivity  analysis,  all  types  of  error 
sources  can  be  tested  to  find  their  effect  on  the  final  statement.  Effort  can  be  put 
into  survey  effort  that  controls  the  sources  of  error.  In  this  manner,  maximum 
accmacy  can  be  obtained  with  minimmn  cost.  Third,  the  error-budget  model 
provides  information  on  error  correction.  To  correct  errors,  the  sources  of  errors 
must  first  be  known.  Errors  from  different  sources  may  require  different  correc¬ 
tion  procedures.  Using  error  decomposition,  the  major  causes  of  errors  can  be 
determined. 

Error-budget  analysis  of  the  plant  population  model  yielded  a  number  of  possible 
sources  of  error.  For  the  initial  estimates  of  the  Shannon  index,  the  major  error 
was  from  misclassification.  Misclassification  is  mainly  due  to  misidentified  spe¬ 
cies,  recording  species  with  the  wrong  code,  and  mistyping  species  codes  into  the 
data  set.  The  major  measurement  error  stemmed  from  the  way  clmnps  were 
coimted.  Over  the  course  of  a  10-year  prediction,  the  system  error  and  modeling 
error  compoxmded,  causing  a  significant  rise  in  the  total  error. 

Uncertainty  in  near  term  model  predictions  was  largely  determined  by  errors 
associated  with  data  collection.  These  types  of  errors  can  be  largely  reduced  by 
making  consistent  measurements  and  with  an  effective  quahty  assimance/control 
program.  Costs  associated  with  reducing  these  sources  of  errors  should  be 
minimal.  However,  as  model  prediction  periods  increase,  modeling  and  system 
errors  become  the  most  important  source  of  uncertainty.  These  sources  of  error 
can  be  reduced  only  through  an  intimate  knowledge  of  the  ecology  and  model.  As 
more  data  are  collected  through  the  years,  the  potential  exists  for  modeling  error 
to  be  reduced. 

The  error-budget  model  presented  in  this  report  illustrates  the  potential  of  using 
error  budgets  to  assist  land  managers.  The  error  budget  provides  the  user  of  the 
model  with  a  means  to  assess  management  alternatives.  The  consequences  of 
alternative  data  collection  and  quality  control  procedures  on  model  predictions 
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can  be  objectively  assessed.  Depending  on  the  time  period  of  concern,  the  error 
budget  identifies  sovirces  of  error  that  most  affect  decision  making  processes. 

Based  on  the  results  of  this  study,  it  is  recommended  that  uncertainty  analysis 
tools  such  as  error  budgets  be  used  more  frequently  in  natural  resources  model¬ 
ing  efforts.  Through  the  use  of  error  budgets,  interpretation  of  model  results  and 
subsequent  management  decisions  can  more  accurately  reflect  owr  real  imder- 
standing  of  the  managed  resomces. 
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